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Abstract 

Convolutional Networks (ConvNets) have recently improved image recognition per¬ 
formance thanks to end-to-end learning of deep feed-forward models from raw pixels. 
Deep learning is a marked departure from the previous state of the art, the Fisher Vector 
(FV), which relied on gradient-based encoding of local hand-crafted features. In this 
paper, we discuss a novel connection between these two approaches. First, we show 
that one can derive gradient representations from ConvNets in a similar fashion to the 
FV. Second, we show that this gradient representation actually corresponds to a struc¬ 
tured matrix that allows for efficient similarity computation. We experimentally study 
the benefits of transferring this representation over the outputs of ConvNet layers, and 
find consistent improvements on the Pascal VOC 2007 and 2012 datasets. 


Introduction 

• ^ 

classification involves describing images with pre-determined labels. One of the first 
C^eakthroughs towards solving this problem was the bag-of-visual-words (BOV) [□, E3]. 
While the BOV simply involves counting the number of occurrences of quantized local fea¬ 
tures, approaches that encode higher order statistics such as the the Fisher Vector (FV) [EB, 
iza] led to state-of-the-art image classification results [□, BZD]. Especially, such higher-order 
encodings were used by the leading teams in the 2010 and 2011 editions of the ImageNet 
Large Scale Visual Recognition Challenge (ILSVRC) [S, ED]. FV-based approaches were 
however outperformed in 2012 by the work of Krizhevsky et al. [ED] based on Convolutional 
Networks (ConvNets) [EZD] trained in a supervised fashion on large amounts of labeled data. 
These models are feed-forward architectures involving multiple computational layers that 
alternate linear operations, e.g. convolutions, and non-linear operations, e.g. rectified linear 
units (ReLU). The end-to-end training of the large number of parameters inside ConvNets 
from pixels to the specific end-task is a key to their success. Since then, ConvNets, including 
improved architectures [E3, E3I, SI], have consistently outperformed all other alternatives in 
subsequent editions of ILSVRC. Also, ConvNets have remarkable transferability properties 
when used as “universal” feature extractors mv. if one feeds an image to a ConvNet, the 
output of intermediate layers might be used as a representation of this image and typically fed 
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Convolutional layers 


Fully connected layers 



Figure 1: AlexNet architecture [ED]. Q are the parameters (4D tensors) of the convolutional layers. 
Wk are the parameters (matrices) of the fully connected layers. Black (resp. red) arrows represent the 
information flow during the forward (resp. backward) pass. Inspired by the Fisher Kernel [O], we 
study the use of gradient-related information (the blue matrices) as transferable representations. 

to linear classifiers. To the best of our knowledge, this heuristic is not based on a strong theo¬ 
retical ground, but has been experimentally shown to work well in practice [0, □, IZ3, IZ3, SI] . 

Although ConvNets and FV approaches differ significantly, several works tried to com¬ 
bine their benefits [DJI, Q, E3, EE] . Our work also attempts to get the best of both FV and 
ConvNet worlds. Our primary contribution is a novel approach to extract a transferable 
representation of an image given a pre-trained ConvNet. We draw inspiration from the FV, 
which is based on the theoretically well-founded Fisher Kernel (FK) proposed by Jaakkola 
and Haussler [O] . The FK involves deriving a kernel from an underlying generative model 
of the data by taking the gradient of the log-likelihood with respect to the model parame¬ 
ters. In a similar manner, given an unlabeled image, we propose to compute the gradient of 
a cross-entropy criterion measured between the predicted class probabilities and an equal 
probability output. This gradient with respect to the parameters of the fully connected layers 
yields very high-dimensional representations (cf. Figure I). Our second contribution con¬ 
sists in leveraging the special structure of this gradient representation to design an efficient 
kernel. We show that our representation actually corresponds to a rank-1 matrix, for which 
the trace kernel can be efficiently computed. Furthermore, this kernel decomposes in our 
case into the product of two simpler kernels: the standard one on forward-pass features, and 
a second one on quantities efficiently computed by back-propagation. 

The remainder of this article is organized as follows. In section 2, we review related 
works. In section 3, we provide more background on the FK and ConvNets. In section 4, 
we introduce our novel hybrid ConvNet-gradient representation as well as our associated 
efficient kernel. Finally, we provide experimental results on the PASCAL VOC 2007 and 
2012 benchmarks in section 5, showing that our representation consistently transfers better 
than the standard forward pass features. 


2 Related Work 


Hybrid techniques. Several works have proposed to combine the benefits of deep learning 
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with "shallow" bag-of-patches representations based on higher-order statistics such as the 
FV [IZa, UB] or the VLAD [HE]. Simony an et al. [El] propose to stack multiple FV layers, 
each defined as a set of five operations: i) FV encoding, ii) supervised dimensionality reduc¬ 
tion, iii) spatial stacking, iv) £2 normalization and v) PCA dimensionality reduction. They 
show that, when combined with the original FV, such networks lead to significant perfor¬ 
mance improvements on ImageNet. Peng et al. [IZ3] proposed a similar idea, but for action 
recognition. Alternatively, Sydorov et al. [EE] improve on the FV framework by jointly 
learning the S VM classifier and the GMM visual vocabulary. Conceptually, this is similar to 
back-propagation as used to learn neural network parameters: the gradients corresponding 
to the SVM layer are back-propagated to compute the gradients with respect to the GMM 
parameters. Peng et al. [El] proposed a similar idea for the VLAD [HE] descriptor. Finally, 
Gong et al. [HI] address the lack of geometric invariance in ConvNets with a hybrid ap¬ 
proach. They extract mid-level ConvNet features from large patches, embed them using the 
VLAD encoding, and aggregate them at multiple scales. This leads to competitive results 
on a number of classification tasks. While our goal - getting the best of the FV and deep 
frameworks - is shared with these previous works, we differ significantly, as we are the first 
to propose to derive gradient features from deep nets. 

Deriving representations from pre-trained classifiers. Classemes [E3, SE] is a com¬ 
mon image representation from a set of classifiers obtained by simply stacking classifier 
scores. Dimensionality reduction is generally applied on classeme features [DU], but learn¬ 
ing separately the classification and dimensionality reduction is suboptimal [O]. Several 
works [□, O, O] learn an optimal embedding of images in a low-dimensional space via clas¬ 
sifiers with an intermediate hidden layer. The first layer can be understood as a supervised 
dimensionality reduction step, while the second one can be interpreted as a set of classifiers 
in the intermediate space. A new image is represented as the output of this intermediate layer, 
discarding the classifiers. A natural extension is to learn deeper architectures, i.e. architec¬ 
tures with more than one hidden layer, and to use the output of these intermediate layers as 
features for the new tasks. Krizhevsky et al. [ED] proposed to learn end-to-end a deep classi¬ 
fier based on the ConvNet architecture of LeCun et al. [EID]. They showed qualitatively that 
the output of the penultimate layer could be used for image retrieval. This finding was quan¬ 
titatively validated for a number of tasks, including image classification [□, 0, E3, E0, SI], 
image retrieval [ffl, E0], object detection [O], and action recognition [E3]. The choice of the 
layer(s) whose output should be used for representation purposes depends on the problem 
at hand. As observed by Yosinski et al. [S3], this choice should be driven by the distance 
between the base task (the one used to learn the classifier) and target task. In this paper, we 
show that this heuristic of using the output of an intermediate-level fully connected layer as 
image representation can be related to the application of the Fisher Kernel idea to ConvNets. 


3 Background on the Fisher Kernel and ConvNets 

3.1 Fisher Kernel 

The Fisher Kernel (FK) is a generic principle introduced to combine the benefits of gener¬ 
ative and discriminative models to pattern recognition. Let A be a sample, and let uq be a 
probability density function that models the generative process of X, where 6 denotes the 
vector of parameters of uq. In statistics, the score function is given by the gradient of the 
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log-likelihood of the data on the model: 

(p^^{X) = Velogue{X). (1) 


This gradient describes the contribution of the individual parameters to the generative pro¬ 
cess. Jaakkola and Haussler [O] proposed to measure the similarity between two samples X 
and Y using the Fisher Kernel (FK) which is dehned as: 

Kfk{X,Y) = (pI‘^{X)'f^^4‘^{Y) ( 2 ) 


where Fq is the Fisher Information Matrix, usually approximated by the identity matrix [O]. 
One of the benehts of the FK framework is that it comes with guarantees. The FK is indeed 
asymptotically at least as good as the MAP decision rule, when assuming that the classihca- 
tion label is included in the generative model as a latent variable (theorem 1 in [O]). Some 
extensions make the dependence of the kernel on the classihcation labels explicit. This in¬ 
cludes the likelihood kernel [O], which involves one generative model per class, and which 
consists in computing one FK for each generative model (and consequently for each class). 
This also includes the likelihood ratio kernel [EE], which is tailored to the two-class problem, 
and which involves computing the gradient of the log-likelihood of the ratio between the two 
class likelihoods. Given two classes denoted c\ and C 2 with class-conditional probability 
density functions p{.\c\) and p{.\c 2 ) and with collective parameters 6, this yields: 


9^^(Z) = Velog 


p{X\ci) 

P{X\C2)' 


( 3 ) 


The likelihood ratio kernel is supported by strong experimental evidence [EE] and theory [E3]. 
In section 4, we extend it to derive a gradient representation from a ConvNet model. 


3.2 Convolutional Networks 

Convolutional Networks (ConvNets) [ED] are the de facto state-of-the-art models for image 
recognition since the work of Krizhevsky et al. [EE]. This class of deep learning models 
relies on a feed forward architecture typically composed of a stack of convolutional layers 
followed by a stack of fully connected layers (see Figure 1 for the standard AlexNet [ED] 
architecture). A convolutional layer is parametrized by a 4D tensor representing a stack 
of 3D hlters. During the forward pass, these hlters are run in a sliding window fashion 
across the output of the previous layer (or the image itself for the hrst layer) in order to 
produce a 3D tensor: the stack of per-hlter activation maps. These activation maps then pass 
through a non-linearity (typically a Rectihed Linear Unit, or ReLU [ED]) and an optional 
pooling stage before being fed to the next layer. Both the standard AlexNet [ED] and recent 
improved architectures like VGGNet [E3] use a stack of fully connected layers to transform 
the output activation map of the convolutional layers into class-membership probabilities. A 
fully connected layer consists in a simple matrix vector multiplication followed by a non¬ 
linearity, typically ReLU for intermediate layers and a SoftMax for the last one. 

Let Xk be the output of layer k, which is also the input of layer k -f 1 (for AlexNet, X 5 
is the flattened activation map of the fifth convolutional layer). Layer k is parametrized by 
the 4D tensor Q if it is a convolutional layer, and by the matrix for a fully connected 
layer. A fully connected layer performs the operation Xk = g{W^X k-\), where a is the 
non-linearity. We note x^-i the output of layer k before the non-linearity, and 6 = 
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{Cl, • • • ,Cm} U {Wm+ 1 , • " , Wl} the parameters of all L layers of the network. Training 
such deep models consists in end-to-end learning of this vast number of parameters via the 
minimization of an error (or loss) function on a large training set of N image and ground- 
truth label pairs The typical loss function used for classification is the cross-entropy: 

£(/‘,g‘;e) = -f g;.log(4,,.) (4) 

i=l 

where P is the number of labels (categories), G {0,1}^ is the label vector of image I\ and 
j is the predicted probability of class j for image P resulting from the forward pass. 

The optimal network parameters 0* are the ones minimizing the loss over the training set: 

N 

0* =argmin^£(/',g,;e) (5) 

e i=i 

This optimization problem is typically solved using Stochastic Gradient Descent (SGD) [□], 
a stochastic approximation of batch gradient descent consisting in doing approximate gra¬ 
dient steps equal on average to the true gradient V^C. Each approximate gradient step is 
typically performed with a small batch of labeled examples in order to efficiently leverage 
the caching and vectorization mechanisms of modern hardware. 

A particularity of deep networks is that the gradients with respect to all parameters 0 
can be computed efficiently in a stage-wise fashion via a sequential application of the chain 
rule (“back-propagation” [El]). In particular, when using ConvNets as feature extractors, 
the first phase consists in pre-training the network (i.e. obtaining 0*) via SGD with back- 
propagation on a large labeled dataset like ImageNet [ED]. Then, ConvNet features can be 
used for different tasks using forward passes on the pre-trained network. In the following, we 
describe how we can also use back-propagation at test time to transfer richer Fisher Vector¬ 
like representations based on the gradient of the loss with respect to the ConvNet parameters. 


4 Gradient Features from Deep Nets 

We now motivate the use of gradient features from deep nets by relating the likelihood ratio 
kernel in equation (3) to the ConvNet objective function in equation (4). We then explicit the 
gradient equations and relate our gradient features to the standard heuristic features derived 
from the outputs of intermediate layers. Finally, we explain how to efficiently compute the 
similarity between these high-dimensional representations. 


4.1 Relating the likelihood ratio kernel and deep nets 

The FK [O] and its extensions [O, EB] were proposed as generic frameworks to derive 
representations and kernels from generative models. As the standard ConvNet classification 
architecture does not define a generative model, such frameworks cannot be applied as-is. 
However, we can draw inspiration from the likelihood ratio kernel for that purpose. We start 
from equation (3) and note that it can be rewritten as the gradient of the log-likelihood of the 
ratio between posterior probabilities (assuming equal class priors), i.e.\ 


(pf(Z) = Velog 


p{X\ci) 

P{X\C2) 


Velog 


p{ci\X) 

P{C2\X)' 


( 6 ) 
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In the two-class problem of [EE], we have p{c 2 \X) = 1 — p{ci |X) and equation (6) gives: 




<i(^) 

l-p(ci|Z) 


(7) 


where (p^^j = VQlogp{cj\X) is the gradient of the log-posterior for class cj. We underline 
that the previous formula is general in the sense that it can be applied beyond generative mod¬ 
els. To extend this representation beyond the two-class case, one may compute an embedding 
(pQ^j for each class j using the gradient of the corresponding log-posterior probability. 

We can now observe the relation between the ConvNet objective E in equation (4) for an 
image I and label vector g with these gradient of log-posterior embeddings: 


Ve£(/,g;e) = - f gjVe\ogp{cj\l) = - f (8) 

7=1 7=1 

Consequently, the gradient of the ConvNet objective can be interpreted as a sum of gradient 
embeddings (p^^J(I), weighted by the labels gj. 

To use this gradient as an image representation, as is the case of the FK, there are two 
main challenges to be addressed. First, we do not have access to the value of the label g, 
which we need to compute the representation of a test image I according to equation (8). The 
simplest solution consists in using a constant uniform label vector g = g = [1/F,...,1/F]. 
Although g is non-informative, we experimentally validate the interest of this simple strat¬ 
egy. The second issue concerning the use of VqE{I^ g\ 0) as an image representation lies in 
the associated computational cost. Although scalable in the number of classes, this represen¬ 
tation is very high-dimensional. The number of parameters 0 in current deep architectures 
is indeed too large to be able to use the full gradient VeE in practice. Therefore, we propose 
to use only the partial derivatives with respect to the parameters of some fixed layers, in the 
same spirit as what is currently done with layer-activation features. These partial derivatives 
can be computed and compared efficiently using the chain rule and a rank 1 decomposition, 
as shown in the following sections. Note also that this approach can be further combined 
with other existing techniques, including ones specialized for deep nets {e.g. model com¬ 
pression [i]) or for FV {e.g. product quantization [O]). 


4.2 Gradient derivation 


One remarkable property of ConvNets and other feed-forward architectures is that they are 
differentiable through all their layers. In the case of ConvNets, it is easy to show that the 
gradients of the loss with respect to the weights of the fully-connected layers are: 

dE 


dE 

dyk 


(9) 


To compute the partial derivatives of the loss with respect to the output parameters needed 
in Equation (9), one can apply the chain rule. In the case of fully-connected layers and ReLU 
non-linearities, this leads to the following recursive definition and base case. 


dE 

dyk 


Wk+i 


dE 


dyk+i. 


oil 




3^ = *-afe) 


( 10 ) 
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where \>o] is an indicator vector, set to one at the positions where j> 0 and to zero oth¬ 
erwise, o is the Hadamard or element-wise product, ^ is a supplied vector of labels with 
which to compute the loss, and a is the SoftMax function. From the previous section, we 
use g = [1/P, ...,1/P], i.e. we assume that all classes have equal probabilities. It is worth 
noticing how ^ is simply a shifted version of the output probabilities, while the derivatives 
w.r.t. yk with ^ < L are linear transformations of these shifted probabilities, as the Hadamard 
product can be rewritten as a matrix multiplication. 


4.3 Computing similarities between gradients 

Using the gradients in equation (9) as features is problematic in practice due to their high¬ 
dimensional nature with current deep architectures. In the case of AlexNet [ED], is 

around 4 million floating point values, while and are each around 16 and 36 million 
floats. Thus, explicitly computing the dot-product between the gradients is impractical. In¬ 
stead, we propose to take advantage of the unique structure of our gradients (rank-1 matrices, 
cf. Eq. (9)) by using the trace kernel, deflned for two matrices A and B as: 

Ktr{A,B) = Tr{A^B) (11) 

For rank-1 matrices, the trace can be decomposed as the product of two kernels. If we let 
A = au^, A G and B = bv^, B G with a, Z? G and u,v e then: 

Ktr{A,B) = Tr{au^{bv^Y) = Tr{au^vb^) = Tr{b^au^v) = {b^a) • (u^v). 


Therefore, for two images A and B, we can compute the similarity between gradients in a 
low-dimensional space without explicitly computing the gradients w.r.t. the weights Wu 








1 T 


dE 

( 12 ) 


The left part of this equation indicates that the forward activations of the two inputs should 
be similar. This is the standard measure of similarity which is used between images when 
described by the outputs of the intermediate layers of ConvNets. However, this similarity 
is multiplicatively weighted by the similarity between the back-propagated quantities. This 
indicates that, to obtain a high value with the proposed kernel, both the target forward acti¬ 
vations and the back-propagated quantities of the images need to be similar. 

Normalization. The ^ 2 -normalization of the activation features consistently leads to su¬ 
perior results [0]. In our experiments we ^ 2 -normalize our forward and backward features 
independently. This is consistent with normalizing the gradient matrix using a Frobenius 
norm, since = 11^2| I 2 I | 2 - 


5 Experimental results 

5.1 Datasets and evaluation protocols 

We evaluate our approach to transfer features from pretrained models on two standard clas- 
siflcation benchmarks, Pascal VOC 2007 and Pascal VOC 2012 [O]. These datasets contain 
9 ,963 and 22,531 annotated images, respectively. Each image is annotated with one or more 
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labels corresponding to 20 object categories. The datasets include partitions for training, 
validating, and testing, and the accuracy is measured in terms of per class mean average 
precision (mAP). The test annotations of VOC 2012 are not public, but an evaluation server 
with a limited number of submissions per week is available. Therefore, we use the validation 
set for the first part of our analysis on the VOC 2012 dataset, and evaluate on the test set only 
for the final experiments. We conduct all VOC 2007 experiments on the full dataset. 


5.2 Implementation details 

We tested our approach on two different deep ConvNets: AlexNet [ED] and VGG16 [E3]. 
VGG16 is a much deeper architecture than AlexNet, with many more convolutional layers, 
leading to superior performance, but also to a much slower training and feature extraction. 
We used the pre-trained networks that are publicly available ^ Both networks were pre¬ 
trained on the ILSVRC2012 subset of ImageNet, which is disjoint from the Pascal VOC 
datasets, and therefore suitable for our evaluation of feature transfer. 

To extract descriptors from the Pascal images, we first resize the images so that the 
shortest size has 227 pixels (224 on the VGG16 case), and then take the central square 
crop, without distorting the aspect ratio. We found this cropping technique to work well in 
practice. For simplicity, we do no data augmentation. The feature extraction is performed on 
a customized version of the caffe library^, modified to expose the back-propagation features. 
This allows us to extract forward and backward features of the training and testing images. 
At testing time we use a tempered version of SoftMax, cr(y, t) = exp(yjx) jY^iexpiyi!x), 
with T = 2, to produce softer probability distributions for backpropagation. As discussed 
in section 4.1, we use non-informative uniform labels for the backward pass to extract the 
gradient features. All forward and backward features are then ^ 2 -normalized. 

To perform classification, we use the SVM implementation of scikit-learn [U3]^. The cost 
parameter C of the solver was set to the default value of 1, which worked well in practice. 

5.3 Results and discussion 

Table 1 summarizes the results, and compares our approach with the state of the art and dif¬ 
ferent baselines. We extract and compare several features for each dataset and network archi¬ 
tecture: (i) individual forward activation features, from Pool5 up to the probability layer; (ii) 
concatenation of forward activation features, e.g. Poo15-fFC6, FC6-fFC7, FC7-fFC 8; (iii) our 
proposed gradient features: and The similarity between ^ 2 -normalized for¬ 

ward activation features is measured with the dot-product, while the similarity between gra¬ 
dient representations is measured using the trace kernel. We highlight the following points. 

Forward activations. In all cases, FC7 is the best performing individual layer on both 
VOC2007 and VOC2012, independently of the network. This is consistent with previous 
findings. Also consistent is the fact that the probability layer performs badly in this case. 
More surprisingly, concatenating forward layers does not seem to bring any noticeable accu¬ 
racy improvements in any setup. 

Gradient representations. We compare the gradient representations with the concate¬ 
nation of forward activations, since they are very related and share part of the features. On 

^https://github.com/BVLC/caffe/wiki/Model-Zoo 
^http://oaffe.berkeleyvision.org 
^http://scikit-learn.org/ 
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Table 1: Left: Results on Pascal VOC 2007 and VOC 2012 with AlexNet (A) and VGG16 (V). Results 
on VOC 2012 are on the validation set. Right: Comparison with other ConvNet results (mAP in %). 





VOC2007 

VOC2012 

Features 



(A) 

(V) 

1 (A) 

(V) 

X 5 (Pool5) 



71.0 

86.7 

66.1 

81.4 

X6 (FC6) 



77.1 

89.3 

72.6 

84.4 

X7 (FC7) 



79.4 

89.4 

74.9 

84.6 

(FC8) 



79.1 

88.3 

74.3 

84.1 

xg (Prob) 



76.2 

86.0 

71.9 

81.3 

X5;X6 



76.4 

89.2 

71.6 

84.0 

II 

[«] 

r 

80.2 

89.3 

75.1 

84.6 

xe;xi 



79.1 

89.5 

74.3 

84.6 

II 

[^] 

r 

80.9 

90.0 

76.3 

85.2 




79.7 

89.2 

75.3 

84.6 

II 

[^] 

r 

79.7 

88.2 

75.0 

83.4 



VOC’07 

VOC’12 

Proposed - AlexNet [ED] 

80.9 

76.5 

Proposed - VGG 16 [E3] 

90.0 

85.3 

DeCAF 

[0] from [B] 

73.4 

- 

Razavian et al. 

im 

77.2 

- 

Oquab et al. 

[E3] 

77.7 

78.7 

Zeiler et al. 

m\ 

- 

79.0 

Chatfield et al. 

[□] 

82.4 

83.2 

He et al. 

im 

80.1 

- 

Wei et al. 

[O] 

81.5 

81.7 

Simonyan et al. 

[IB] 

89.7 

89.3 


the deeper layers (6 and 7) the gradient representations outperform the individual features 
as well as the concatenation both for AlexNet and VGG16 on both datasets. For AlexNet, 
the improvements are quite significant: +3.8% and +3.5% absolute improvement for the 
gradients with respect to on VOC2007 and VOC2012, and +1.8% and +2% for Wj. The 
improvements for VGG16 are more modest but still noticeable: +0.1% and +0.6% for the 
gradients with respect to and +0.5% and +0.6% for the gradients with respect to Wj. 
Larger relative improvements on less discriminative networks such as AlexNet seem to sug¬ 
gest that the more complex gradient representation can, to some extent, compensate for the 
lack of discriminative power of the network, but that one obtains diminishing returns as the 
power of the network increases. Once one reaches the top of the network (FC8), the gradient 
representations perform worse and these improvements diminish or disappear completely. 
This is expected, as the derivative with respect to Wg depends heavily on the output of the 
probability layer, which is known to saturate. However, for the derivatives with respect to 
W6 and Wj, more information is involved, leading to superior results. 

Comparison with other works. Our best results are compared with the state-of-the-art 
on PASCAL VOC2007 and VOC2012 in Table 1. We can see that we obtain competitive 
performance on both datasets. We note however that our results with VGG16 are somewhat 
inferior to those reported in [E3] with a similar model. We believe this might be explained 
by the more costly feature extraction strategy employed by Simonyan and Zisserman which 
involves aggregating image descriptors at multiple scales. 

Per-class results. We report per-class results for Pascal VOC2007 on Table 2 and for 
VOC2012 on Table 3. We compare the best forward features (individual FC7) with the best 
gradient representation (^). The results on VOC2007 are on the test set. For VOC2012, 
we report results both on validation and on test. We observe how, on both networks and 
datasets, the results are consistently better even when the improvements are not large. For 
AlexNet, the gradient representation has the best performance on 18 out of the 20 classes 
on VOC2007, and on all classes for VOC2012. For VGG, the gradient representation is the 
best one on 17 out of the 20 classes both on VOC2007 and VOC2012 (validation). The 
differences between validation and test on VOC2012 are minimal. 
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Table 2: Results on Pascal VOC2007 with AlexNet and VGG16. Comparison between the 
standard forward activation features and the proposed gradient features. 


Features 







I 


AlexNet 


X7 (FC7) 
dE 

dwj 


79.4 

80.9 


95.4 88.6 92.6 87.3 42.1 80.1 90.5 89.6 59.9 68.2 74.1 85.3 89.8 85.6 95.3 58.1 78.9 57.9 94.7 74.4 

96.6 89.2 93.8 89.5 44.9 81.0 91.9 89.9 61.2 70.4 78.5 86.2 91.4 87.4 95.7 60.5 78.8 62.5 95.2 73.5 


VGG16 


X7 (FC7) 
dE 


89.3 

90.0 


99.2 95.9 99.1 96.9 63.8 92.8 95.1 98.1 70.4 87.8 84.3 97.0 97.2 93.5 97.3 68.6 92.2 73.3 98.7 85.5 

99.6 97.2 98.8 97.0 63.3 93.8 95.6 98.4 71.1 89.4 85.3 97.7 97.7 95.6 97.5 70.3 92.7 76.2 98.8 84.2 


Table 3: Results on Pascal VOC2012 with AlexNet and VGG16. Comparison between the 
standard forward activation features and the proposed gradient features. 


I 

Features S’ 





AlexNet (evaluated on the validation set) 


X7 (FC7) 
dE 
dWq 


74.9 

76.3 


92.9 75.4 88.7 81.7 48.0 89.0 70.3 88.0 62.3 63.6 57.8 83.5 78.0 82.9 92.9 49.1 74.8 50.5 90.2 78.7 

94.3 77.4 89.5 82.2 50.8 90.2 72.4 89.3 64.8 63.9 60.3 84.0 79.6 84.0 93.2 50.6 76.7 52.6 91.8 79.2 


AlexNet (evaluated on the test set) 


X7 (FC7) 
dE 
dWq 


75.0 

76.5 


93.8 75.0 86.4 82.2 48.2 82.5 73.8 87.6 63.8 63.5 69.3 85.7 80.3 84.1 92.3 47.4 72.2 51.8 

95.0 76.6 87.7 82.9 52.5 83.4 75.6 88.6 65.3 65.4 69.8 86.5 82.1 85.1 93.0 48.2 74.5 57.0 


VGG16 (evaluated on the validation set) 


72.5 

73.0 


X7 (FC7) 
dE 
dWq 


84.6 

85.2 


98.2 88.3 94.6 90.5 66.0 93.6 80.5 96.4 73.9 81.3 70.2 93.0 91.3 91.3 95.1 56.3 87.7 64.2 95.8 84.5 

98.6 89.4 94.7 91.5 67.2 94.0 80.9 96.8 73.7 83.7 71.9 93.4 91.6 91.5 95.4 56.0 88.3 65.2 95.5 85.2 


VGG16 (evaluated on the test set) 


X7 (FC7) 
dE 




85.0 97.8 85.2 92.3 91.1 64.5 89.7 82.2 95.4 74.1 84.7 81.1 94.1 93.5 91.9 95.0 57.9 86.0 67.8 95.2 81.5 

85.3 98.0 86.0 91.7 91.3 65.7 89.6 82.4 95.5 74.5 84.2 80.7 94.3 93.7 92.2 95.4 57.7 87.2 69.2 95.2 81.4 


6 Conclusions 


In this paper we show a link between ConvNets as feature extractors and Fisher Vector 
encodings. We have introduced a gradient-based representation for features extracted with 
ConvNets inspired by the Fisher Kernel framework. This representation takes advantage of 
the high-quality features learned by ConvNets on an end-to-end supervised manner, and of 
the discriminative power of gradient-based representations. We also presented an approach to 
compute similarities between gradients in an efficient manner without computing explicitly 
the high-dimensional gradient representations. We show that this similarity can be seen as 
a weighed version of the forward feature similarities that takes into account not only the 
features themselves, but also information back-propagated from the ConvNet objective. We 
tested our approach on the Pascal VOC2007 and VOC2012 benchmarks using two different 
popular deep architectures, showing consistent improvements over using only the individual 
forward activation features or their combination as it is standard practice. 
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